PB93-228724 


Kns 

Information  is  our  business. 


THE  FLUENT  ABSTRACT  MACHINE 


THINKING  MACHINES  CORP. 
CAMBRIDGE,  MA 


1987 


pnc  QUALITF  mOPBaTSS  e 


U.S.  DEPARTMENT  OF  COMMERCE 
National  Technical  Information  Service 


Illllllllllllllllllllllllllllll 

PB93-228724 


The  Fluent  Abstract  Machine 

A.G.  Ranade,  S.N.  Bhatt  and  S.L.Johnsson,  Yale  University 


•  S 


Thinking  Machines  Corporation  ba87-3 

Technical  Report  Series 


REPRODUCED  BY 

U.S.  DEPARTMENT  OF  COMMERCE 
NATIONAL  TECHNICAL 
INFORMATION  SERVICE 
SPRINGFIELD,  VA  22161 


REPORT  DOCUMENTATION  PAGE 


PB93-228724 


iraen  t^.is  colieaion  of  Jf-forn-iar^or  «$  esTimatea  xc  average  1  nour  oer  response,  inctuding  the  time  for  reviewing  instrurtions,  searching  existing 
itaipinc  the  osta  needea,  ano  corr.D'eTtng  anc  rev'ie^Mnc  the  colieaton  information  Send~comments  regarding  this  burden  estimate  or  any  other 
rat'Cn/inciuding  suggestions  tor  reducing  rh.s  pjrden,  fo  Washington  Heaacuarters  Services,  Directorate  for  information  Operations  and  Reports,  1 
le  Ariir.ctcr,,  V'-  222j.'"3302  and  to  f-e  Office  of  Management  and  Budget.  Paperwork  Reduction  Project  (0704-0188),  Washington,  DC  20503. 


data  sources,  I 
aspect  of  this  i 
215  Jefferson  \ 


3.  REPORT  TYPE  AND  DATES  COVERED 

T  ' 


4.  TITLE  AND  SUBTITLE 

The  fluent  abstract  machine 


5.  FUNDING  NUMBERS 

ONR-N00014-86-K-0564 
NSF  MIP-8601885 


6.  AUTHOR(S) 

A.  Ranade,  S.  Bhatt,  and  S.  L.  Johnsson 


7.  PERFORMING  ORGANIZATION  NAME(5)  AND  ADDRESS(ES) 

Thinking  Machines  Corp. 

245  First  Street 
Cambridge,  ffA  02142-1264 


8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


TMC-12 


9.  SPONSORING /MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

ONR  --Department  of  the  NAvy 

The  Pentagon  Washington,  DC  20350 
NSF  --  1800  G  Street  NW. ,  Washington  DC  21550 


10.  SPONSORING /MONITORING 
AGENCY  REPORT  NUMBER 


FRACT 

WO) 

.uent 

abstract 

ms 

litra 

ry  access 

pa 

les  t 

he  multipi 

•ef 

;ts  0 

f  over  one 

I  1 

rer 

fu 

1  pr 

in 

1  additi 

on 

‘ep 

toir 

he 

fluent  mac 

:hine  al 

so 

'el 

s 

et  0 

nerati 

on 

s.  The  flue 

;nt  mach 

ini 

‘S 

in 

terc 

ed 

by  a  butte 

jrf ly  ne 

tw 

‘om 

a 

ver 

y  simp 

le 

router,  wl 

lich  eff 

ec 

he 

r 

outi 

ng  har 

dw 

are  is  exti 

renroly  s 

inr 

14.  SUBJECT  TERMS 

Basic  Algorithms 


15.  NUMBER  OF  PAGES 

22. 


16.  PRICE  CODE 


N5M  7540-0'. -280-5500 


18.  SECURiTY  CLASSIFICATION 

OF  THIS  PAGE 

1S.  SECURITY  CLASSIFICATION 

OF  ABSTRACT 

[  20.  LIMITATION  Or  ABSTRACT 

j  ^ 

.qAP 

I  SAP 

Stanaard  29S  (Rev  2-89) 

pre.,.;r,t>ec  t)'/  ANS  StG  Z39'tS 


GENERAL  INSTRUCTICNS  FOR  CQMPLETih/G  SF  298 


The  Report  Documentetion.  Page  (RDP)  is  used  in  announcing  and  cataloging  reports.  It  is  important 
that  this  information  be  consistent  with  the  rest  o*^  the  reoort,  particularly  the  cover  and  title  page. 
Instructions  for  filiing  in  each  block  of  the  formi  follow  It  is  important  to  stay  within  the  lines  to  meet 
optical  scanning  requirements. 


Block  1.  Agency  Use  Only  (Leave  blank). 

Block  2.  Report  Date.  Full  publication  date 
including  day,  month,  and  year,  if  available  (e.g.  1 
Jan  88).  Must  cite  at  least  the  year. 


Block  12a.  Distribution/Avai lability  Statement. 
Denotes  public  availability  or  limitations.  Cite  any 
availability  to  the  public.  Enter  additional  . 
limitations  or  special  markings  in  all  capitals  (e.g. 
NOFORN,  REL,  ITAR). 


Blocks.  Type  of  Report  and  Dates  Covered. 
State  whether  report  is  interim,  final,  etc.  if 
applicable,  enter  inclusive  report  dates  (e.g.  10 
Jun87-30Jun88). 

Block  4.  Title  and  Subtitle.  A  title  is  taken  from 
the  part  of  the  report  that  provides  the  most 
meaningful  and  complete  information.  When  a 
report  is  prepared  in  more  than  one  volume, 
repeat  the  primary  title,  add  volume  number,  and 
include  subtitle  for  the  specific  volume.  On 
classified  documents  enter  the  title  classification 
in  parentheses. 

Blocks.  Funding  Numbers.  To  include  contract 
and  grant  numbers;  may  include  program 
element  number(s),  project  number(s),  task 
number(s),  and  work  unit  number(s).  Use  the 
following  labels: 


DOD  -  See  DoDD  5230.24,  "Distribution 
Statements  on  Technical 
Documents." 

DOE  -  See  authorities. 

NASA  -  See  Handbook  NHB  2200.2. 

NTIS  -  Leave  blank. 

Block  12b.  Distribution  Code. 

DOD  ”  Leave  blank. 

DOE  -  Enter  DOE  distribution  categories 
from  the  Standard  Distribution  for 
Unclassified  Scientific  and  Technical 
Reports. 

NASA  -  Leave  blank. 

NTIS  -  Leave  blank. 


Contract 

PR  - 

Project 

Grant 

TA  - 

Task 

Program 

WU  - 

Work  Unit 

Element 

Accession  No 

Blocks.  Author(s).  Name(s)  of  person(s) 
responsible  for  writing  the  report,  performing 
the  research,  or  credited  with  the  content  of  the 
report.  If  editor  or  compiler,  this  should  follow 
the  name(s). 

Block?.  Performing  Organization  Name(s)  and 
Address(es).  Self-explanatory, 

Block  8.  Performing  Organization  Report 
Num.ber.  Enter  the  unique  alphanumeric  report 
number(s)  assigned  by  the  organization 
performing  the  report. 

Block  9.  Sponsorinq/Monitorinc  Agency  Namefs) 
and  Address(es).  Self-explanatory. 

Block  10,  Sponsor! nq/Monitorinq  Agency 
Report  Number,  (if  known) 

Block  11.  Supplementary  Notes.  Enter 
information  not  included  elsewhere  such  as: 
Prepared  in  cooperation  with...;  Trans,  of...;  To  be 
published  in....  When  a  report  is  revised,  include 
a  statement  w'hether  the  ne^w  report  supersedes 
or  suppi'ements  the  elder  report. 


Block  13.  Abstract.  Include  a  brief  r/Wax/mum 
200  words)  factual  summary  of  the  most 
significant  information  contained  in  the  report. 

Block  14.  Subject  Terms.  Keywords  or  phrases 
identifying  major  subjects  in  the  report. 

Block  15.  Number  of  Pages.  Enter  the  total 
number  of  pages. 

Block  16.  Price  Code.  Enter  appropriate  price 
code  (NTIS  only). 

Blocks  17.  - 19.  Security  Classifications.  Self- 
explanatory.  Enter  U.S.  Security  Classification  in 
accordance  with  U.S.  Security  Regulations  (i.e., 
UNCLASSIFIED).  If  form  contains  classified 
information,  stamp  classification  on  the  top  and 
bottom  of  the  page. 

Block  20.  Limitation  of  Abstract.  This  block  must 
be  completed  to  assign  a  limitation  to  the 
abstract.  Enter  either  UL  (unlimited)  or  5AR  (same 
as  report).  An  entry  in  this  block  is  necessary  if 
the  abstract  is  to  be  lim.ited.  If  blank,  the  abstract 
is  assumed  to  be  unlimited. 


The  Fluent  Abstract  Machine 


Abhiram  G.  Ranade 
Sandeep  N.  Bhatt 
S.  Lennart  Johnsson 

Department  of  Computer  Science 

Yale  University 

New  Haven  CT  06520. 

BA87-3 


Underlying  every  general  programming  model  is  a  shared  address  space,  Ev* 
ery  process  can  potentially  access  any  object  in  this  space  in  one  step.  While 
this  allows  tremendous  expressive  power,  it  poses  an  enormous  challenge  to 
the  communications  hardware.  This  conflict  between  ideal  programming  morf- 
els  and  real  architectures  has  traditionally  been  resolved  by  supporting  a  less 
general  model  which  restricts  the  possible  patterns  of  access. 

The  Fluent  abstract  machine  supports  a  very  powerful  programming  model. 
In  addition  to  arbitrary  access  patterns,  the  instruction  repertoire  of  the  Fluent 
machine  also  includes  the  multipreflx  operation  and  high-level  set  operations. 
The  Fluent  machine  consists  of  over  one  hundred  thousand  processors  inter¬ 
connected  by  a  butterfly  network.  The  efficiency  of  the  Fluent  machine  derives 
from  a  very  simple  router,  which  effectively  eliminates  the  possibility  of  con¬ 
gestion,  The  routing  hardware  is  extremely  simple,  inexpensive,  and  provably 
efficient. 


1  Introduction 

We  envisage  building  a  Fluent  machine  with  over  one  hundred  thousand  processors. 
Except  for  highly  structured  computations,  such  a  large  computer  must  necessarily 
spend  a  good  deal  of  time  commimicating  messages  between  its  processors.  As  long 
as  the  total  communication  time  does  not  swamp  the  total  computation  time,  high 
performance  is  guzo’anteed. 

Large  parallel  computers  are  also  difficult  to  program.  The  situation  becomes 
intolerable  if  the  programmer  must  explicitly  manage  x:ommunicat  ion  between  pro¬ 
cessors.  For  this  reason  it  is  necessary  to  have  a  powerful  programming  model  (ab¬ 
stract  machine)  which  abstracts  away  concerns  not  directly  relevant  to  the  problem 
being  solved.  For  overall  performance,  the  abstract  machine  must  be  efficiently 
supported  on  the  underlying  machine. 

Of  the  programming  models  proposed  thus  far,  shared  memory  models  have  been 
the  most  attractive.  The  most  general  shared  memory  models  in  the  literature,  the 
concurrent-read  concurrent- write  parallel  random-access  machines  (GROW  PRAMS) 
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allow  an  arbitrary  number  of  processors  to  read  or  write  a  common  memory  location 
in  one  time  step.  Complex  conununications  operations,  broadcast  and  multicast  for 
example,  can  be  implemented  in  one  step.  Abstracting  complex  communications 
patterns  into  unit  steps  greatly  simplifies  the  tasks  of  designing  algorithms  and 
writing  programs.  For  this  reason,  CRCW  PRAM  models  are  favored  over  weedcer 
abstract  machine  models  for  which  most,  if  not  all,  of  the  programming  effort  is 
spent  synchronizing  the  movement  of  data. 

How  do  we  implement  a  shared  memory  model  on  a  machine  with  processors 
and  memories  distributed  throughout  an  interconnection  network?  The  solution  is 
to  devise  an  efficient  router  which  emulates  shared  memory  operations  and  hides 
details  of  the  communications  network  from  the  user.  This  is  precisely  what  recent 
machines  such  as  the  Thinking  Machines  Corporation’s  Connection  Machine  [8,9], 
the  BBN  Butterfly  [2]  and  Monarch,  the  IBM  RP3  [13],  and  the  NYU  Ultracomputer 
[6]  aim  to  achieve. 

These  machines  emulate  abstract  machines  of  varying  generality  and  power.  The 
Connection  Machine  CM2  has  hardware  support  for  concurrent  read  as  well  as  con¬ 
current  write  operations  with  combination.  The  Connection  Machine  and  the  NYU 
Ultracomputer/RP3  efficiently  support  the  scan  operation  [4].  The  Ultracomputer 
and  RP3  also  support  the  fetch-cmd-add  operation,  but  the  switching  hardware  is 
expensive  and  experiments  reveal  poor  performance  because  of  “hot  spots”  [11,14]. 
It  thus  becomes  difficult  to  argue  that  the  abstract  machine  operations  are  per¬ 
formed  in  unit  time. 

The  Fluent  abstract  machine  subsumes  each  of  the  abstract  machines  mentioned 
above.  In  fact,  the  multiprefix  primitive  of  Fluent  requires  arbitrarily  many  primitive 
operations  on  the  other  abstract  machines.  The  Fluent  instruction  set  also  includes 
basic  set  operations.  With  its  rich  instruction  set,  the  Fluent  abstract  machine  is 
readily  suited  as  an  intermediate  language  for  compiling  very  high  level  languages. 

The  Fluent  abstract  machine  can  be  supported  efiiciently  and  inexpensively  in 
hardware.  The  heart  of  the  Fluent  machine  is  the  router  which  is  based  on  the 
recent  work  of  Ranade  [16].  In  contrast  with  the  Ultracomputer  and  RP3,  the 
hardware  requirements  are  minimal.  More  importantly,  we  can  prove  that  each 
Fluent  instruction  is  implemented  quickly.  This  justifies  our  thesis  that  large  Fluent 
machines  will  be  less  expensive,  faster  and  easier  to  program  than  existing  parallel 
machines. 

The  remainder  of  this  extended  abstract  is  organized  as  follows.  Section  2  de¬ 
scribes  the  Fluent  abstract  machine  and  contrasts  it  with  other  models.  Section 
3  outlines  the  implementation  of  the  abstract  machine  on  the  butterfiy  network. 
Section  4  outlines  a  design  for  the  routing  switch.  Section  5  describes  the  Fluent 
machine,  presents  results  of  timing  simulations,  and  cost  and  performance  estimates. 
Section  6  concludes  with  some  of  the  important  research  issues  that  need  further 
study,  and  outlines  our  ongoing  work. 
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2  The  Fluent  Abstract  Machine 

This  section  describes  the  primitive  instructions  of  the  Fluent  abstract  machine,  and 
contrasts  the  Fluent  programming  model  with  other  models.  In  later  sections  we 
show  how  every  instruction  is  supported  efficiently  in  hardware.  As  a  consequence, 
the  time-complexity  of  a  Fluent  program  can  be  easily  estimated  as  the  maximum 
number  of  primitive  instructions  executed  by  one  processor. 

The  Fluent  abstract  machine  has  N  (virtual)  processors  indexed  l,2,...,iV 
which  are  connected  to  a  shared  memory  holding  M  variables  indexed  1, 2, . . . ,  Af. 
The  processors  of  the  abstract  machine  operate  synchronously  in  discrete  time  cy¬ 
cles.  Every  primitive  instruction  is  executed  in  one  time  cycle;  executing  an  in¬ 
struction  at  time  T  (in  the  Tth  time  cycle)  has  the  effect  of  changing  the  state  that 
existed  at  the  start  of  time  cycle  T. 

The  Fluent  abstract  machine  is  characterized  by  two  types  of  primitives  — 
multiprefix  and  set  operations.  The  multiprefix  operation  is  a  fully  general  prefix 
operation  and  subsumes  the  fetch-and-op  primitive  on  the  NYU  Ultracomputer  [7], 
as  well  as  the  scan  operation  on  the  Connection  machine  [4].  Set  operations  are  not 
supported  as  primitives  on  these  machines.  With  its  primitive  set  operations,  the 
Fluent  machine  can  be  programmed  at  a  very  high-level  of  abstraction. 


2.1  The  Multiprefix  Operation 

The  multiprefix  operation  has  the  form  A/P(A,  u,  0)  where  A  is  a  shared  variable, 
w  is  a  value,  and  0  is  a  binary  associative  operator.  At  any  time  step  a  processor 
ran  execute  a  multiprefix  operation,  with  the  constraint  that  if  Pi  and  Py  execute 
MP{A,Vi,®i)  and  MP(A,t;,-,0y),  then  0,-  =  0y.  The  semantics  of  the  multiprefix 
operator  is  as  follows: 

At  time  T  let  Pa  =  {pi  •  •  •  Pt}  be  the  set  of  processors  refering  to  vari¬ 
able  A,  such  that  pi  <  Pi  <  ••  <  Pt-  Suppose  that  p,-  €  Pa  executes 
instruction  MP{A,Vi,®).  Let  a©  be  the  value  of  A  at  the  start  of  time 
T.  Then,  at  the  end  of  time  cycle  T,  processor  p,  will  receive  the  value 
Oq  0  Vi  0  •  •  •  0  v,_i  and  the  value  of  variable  A  will  be  o©  0  0  •  *  •  0 1;*. 

Thus,  when  a  set  of  processors  perform  a  multiprefix  operation  on  a  common 
variable,  the  result  is  the  same  as  if  a  single  prefix  operation  were  performed  with  the 
processors  ordered  by  their  index.  For  example,  suppose  that  processors  numbered 
25, 32  and  65  execute  the  instructions  AfP(A,4,  +), MP(A,  7,  -h)  and  MP{A,  11,  +) 
respectively  at  time  T,  and  suppose  that  variable  A  initially  contains  the  value  5. 
Then,  at  the  end  of  the  Tth  cycle,  processor  25  will  receive  5,  processor  32  will 
receive  9,  processor  65  will  receive  16,  and  the  variable  A  will  equal  27. 

The  fetch-Jind-0  operation  [7]  also  calculates  a  set  of  prefixes,  but  the  order  of 
inputs  is  undetermined  before  execution.  Multiprefix  is  a  determinate  implementa¬ 
tion  of  the  fetch-and-0,  and  is  more  powerful.  The  scan  operation  [4]  is  a  special 
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case  of  the  multiprefix,  one  in  which  the  set  S  includes  all  N  processors.  Scan 
does  not  allow  multiple  prefixes  over  all  collections  of  disjoint  subsets,  whereas  the 
multiprefix  does. 

For  convenience  we  include  two  more  primitives  —  READ  and  WRITE.  READ(A) 
returns  the  value  of  A  to  the  requesting  processor.  WRITE(j4,  t;,0)  is  equivalent 
to  MP{A,v^®)  except  that  no  value  is  retiuned  to  the  processor  executing  the 
instruction.  Both  operations  are  special  cases  of  multiprefix,  as  has  been  observed 
earlier  [7]. 

2.2  A  Fast  Radix  Sort  using  Multiprefix 

In  this  section  we  present  a  radix  sort  based  on  the  multiprefix  instruction.  The 
program  is  considerably  simpler  than  Batcher’s  bitonic  sort  [1]  and  comparable  in 
performzuice  when  the  number  of  keys  is  very  large. 

When  each  key  to  be  sorted  is  less  than  log  AT  bits  in  size,  fetch-and-add  can 
be  used  to  sort  N  keys  in  a  constsmt  number  of  steps.  Unfortunately,  this  idea 
cannot  be  used  iteratively  to  sort  longer  keys  because  the  fetch-and-add,  being 
non-deterministic,  is  not  stable  [4]. 

With  the  multiprefix  we  can  implement  a  stable  iterative  radix  sort.  As  we  show 
below,  N  keys,  each  k  log  AT  bits  long,  can  be  sorted  in  0{k)  Fluent  instructions. 
When  k  itself  is  small,  the  number  of  Fluent  instructions  executed  is  constant.  In 
contrast,  no  other  programming  model  supports  such  a  concise  sort  even  for  short 
keys. 

Theorem  1  N  keys,  each  of  size  klogN,  can  be  sorted  in  0(k)  steps  on  the  Fluent 
abstract  machine. 

Proof.  We  first  describe  a  stable  scheme  for  N  keys  of  length  logJV,  one  key  per 
processor.  The  total  number  of  distinct  key  values  is  N.  Below  we  give  the  program 
for  each  processor.  The  keys  to  be  sorted  are  stored  in  an  auray  The  idea 

is  to  first  count  the  number  of  occurrences^  of  KEY  (t)  that  lie  in  processors  indexed 
less  than  t,  then  add  to  that  the  cumulative  sum  of  the  counts  for  keys  less  than 
KEY[i]. 


SHORTSORT: 

COUNT [♦]  0 

CUMOLATIVEC*]  0 

TEMP  0 

MP  (COUNT  [KEY  [♦]]  .  1.  +) 

CUMULATIVEC*]  MP(TEMP.  COUNT [*]  ,  +) 

return  MP (CUMULATIVE [KEY [♦]] .  1,  +) 

^This  simple  histogram  computation  cannot  be  done  in  a  constant  number  of  steps  on  the  scan- 
model  [4]. 
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Because  the  multiprefix  operation  is  ordered  by  processor  indices,  the  simple 
sort  above  is  stable.  We  can  iterate  shortsort  to  sort  larger  keys  in  blocks.  The 
primitive  operation  LSBLOCK^Wfj)  below  returns  the  least  significant  j'th  block 
of  log  N  bits  of  location  ty,  that  is,  bits  (J  -  l)  log  JV  +  1  through  jlog  N. 

SORT: 

RANKM  0 

KETPTRC*]  :•  ♦  (initialize  pointer  to  sell) 

FOR  j-1  to  k  DO 

KEY[*]  LSBLOCK(KEYPTR[*] ,  j) 

RANKf*]  SHORTSORT [KEY [*]] 

KEYPTR [RANKE*]]  KEYPTR[*] 

ENDDO 

2.3  Set  Operations 

Sets  are  a  fundamental  data  abstraction.  Traditionally,  sets  have  not  been  sup¬ 
ported  as  primitive  objects,  but  instead  have  been  built  on  top  of  lower  level  struc¬ 
tures  such  as  lists,  arrays,  trees  and  tables.  The  Fluent  abstract  machine  includes 
set  operations  as  primitives: 

•  INSERT  (i,  S)  Insert  element  x  into  set  S. 

•  DELETE  (x,  S)  Delete  element  x  from  set  S. 

•  MEMBER?  (x,  5)  Is  x  an  element  of  the  set  5? 

•  APPLY  (5,  /)  Apply  the  function  /  to  the  elements  of  set  S.  Note  that  / 
may  change  the  values  of  the  elements  in  S. 

•  REDUCE  (5,  /)  Evaluate  /  with  arguments  that  are  elements  of  S.  Note 
that  /  must  be  a  binary  associative  operator. 

In  addition,  set  union,  intersection,  difference,  prefix,  and  enumerate  are  also 
supported. 

Every  Fluent  processor  can  execute  a  set  instruction,  so  that  many  sets  can  be 
manipulated  simultameously.  For  example,  several  processors  may  simultaneously 
insert  elements,  possibly  into  the  same  set.  The  result  of  concurrent  set  opera¬ 
tions  is  as  if  the  individual  instructions  were  executed,  atomically  in  some  arbitrary 
unspecified  serial  order.  The  implementation  however  is  completely  parallel,  and 
provably  eflicient.  The  ability  to  simultaneously  update  multiple  sets  is  costly  on 
existing  machines. 
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3  Implementing  Fluent  Instructions 

This  section  describes  how  the  Fluent  abstract  machine  is  implemented  on  the 
butterfly  network.  The  routing  algorithm  used  is  extremely  simple  and  provably 
efficient,  and  forms  the  basis  of  the  Fluent  machine  proposed  in  Section  5. 

3.1  The  Fluent  Network 

The  nodes  of  the  Fluent  machine  are  interconnected  in  the  butterfly  (FFT)  pattern. 
There  are  2"  nodes  in  each  of  n  +  1  levels,  for  a  total  of  iV  =  (n  +  1)2"  nodes. 
Each  node  is  labelled  with  a  string  (c,r)  (0  <  c  <  n,  0  <  r  <  2")  formed  by 
concatenating  the  binary  representations  of  the  level  number  c  and  the  index  r  of 
the  node  within  the  level.  Each  node  (c,  r)  (c  <  n)  is  connected  by  forward  links  to 
the  nodes  (c  +  l,r)  and  (c  +  l,r  ©  2*),  where  ©  denotes  bitwise  exclusive  or.  Each 
node  (except  for  levels  0  and  n)  thus  has  four  connections:  two  connections  to  the 
next  higher  level  and  two  to  the  previous  level. 

Each  node  in  the  butterfly  contains  a  processor,  a  memory  module  and  6  routing 
switches.  Each  switch  has  2  inputs  and  2  outputs.  Every  input  into  a  switch  enters 
a  first-in  first-out  queue,  which  has  the  capacity  to  buffer  a  small  number  (2  or  3) 
of  messages  in  transit. 

3.2  The  Address  Map 

The  shared  variables  of  a  Fluent  program  are  distributed  among  the  local  memories 
of  the  nodes  using  an  appropriately  chosen  address  map.  If  the  Fluent  program 
does  not  involve  run-time  address  computation  then  the  physical  address  of  each 
shared  variable  can  be  embedded  within  the  program  of  each  processor.  Otherwise, 
we  must  compute  addresses  quickly  at  run  time. 

We  propose  to  distribute  the  M  shared  variables  randomly  among  the  proces¬ 
sors,  each  processor  being  assigned  M/N  variables.  With  a  random  hash  function, 
memory  bottlenecks  are  unlikely  because  the  accessed  variables  will  be  distributed 
throughout  the  network.  Suppose  that  we  have  chosen  such  a  hash  function  )i  *. 
This  function  maps  a  logM  bit  address  to  a  logiV  bit  node  address.  A  second 
function  computes  the  address  (log{M/JV)  bits)  within  the  memory  of  node 
)/(x).  The  physical  address  of  shared  variable  x  is  given  by  the  concatenation 
{){{x),M{x))  . 

*Our  tiinulations  show  that  simple  first  degree  polynomials  perform  well  in  practice.  A  random 
(?(log  N)  degree  polynomial  provably  works  well  [10,16]. 
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Figure  1:  Logical  Network 

3.3  Message  Structure  and  Path 

Suppose  that  processor  (c,r)  wishes  to  access  variable  x.  It  generates  a  REQUEST, 
a  message  of  the  form  (dest, type, data).  The  destination  dest  is  (^(x),  A<(x))  ,  the 
physical  address  of  varible  x.  The  type  field  denotes  the  kind  of  access  requested, 
e.g.  READ,  WRITE,  or  MP.  Other  possible  values  include  EOS  or  GHOST,  which 
are  used  internally  by  the  communication  algorithm  as  we  will  see  shortly.  The 
REQUEST  is  injected  into  the  network.  It  will  reach  node  )/(x)  and  return  with  the 
required  data. 

The  path  from  node  (c,r)  to  node  ^(x)  =  (c',r')  and  back  involves  6  phases 
through  the  butterfly.  Every  other  phase  is  a  forward  phase,  and  these  are  inter¬ 
leaved  with  backward  phases.  Figure  1  shows  the  6  phases. 

In  the  first  phase,  the  message  issued  at  node  (c,  r)  is  directed  to  node  (n,r). 
In  Phase  2,  the  message  follows  the  unique  (backward)  path  in  the  butterfiy  from 
node  (n,  r)  to  node  (0,  r') .  This  path  is  determined  at  each  switch  by  looking  at  the 
appropriate  bit  of  dest.  In  Phase  3,  the  message  reaches  the  node  (o',  r') ,  where  it 
acquires  the  required  data.  The  next  3  phases  simply  retrace  the  path  traced  thus 
far,  back  to  the  source  processor  (c,  r) .  The  access  is  now  complete. 

For  convenience,  we  describe  the  routing  mechanism  in  terms  of  the  logical 
network  of  Figtire  1  instead  of  the  butterfly.  The  correspondence  between  the  two 
b  clear  and  each  butterfly  node  does  the  work  of  6  switches  in  the  logical  network. 


3.4  How  to  Combine  Messages 

At  the  heart  of  the  Fluent  machine  lies  the  routing  strategy  [16].  The  key  idea  is  a 
simple  way  of  combining  instructions  that  reference  a  common  variable.  Consider 
the  case  when  several  processors  READ  a  common  variable.  The  paths  of  these 
messages  form  a  tree,  as  in  Figure  2.  Each  message  moves  along  the  directed  path 
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-  Requesting 
®  processors 

P  Module  holding 
°  location 

—  Network  Link 
—  Message  path 
Figure  2:  Message  paths  to  a  common  location  form  a  tree 
from  its  source  to  the  destination. 

There  is,  however,  no  need  to  send  more  than  one  request  along  any  branch  of 
this  tree.  Each  tree  node  forwards  a  request  only  when  it  “knows”  that  no  future 
incoming  request  will  have  the  same  destination.  The  key  idea  here  is  that  each  node 
forwards  requests  in  ascending  order  of  destination  addresses.  Each  node  receives 
messages  along  two  incoming  edges  and  places  them  into  the  corresponding  FIFO 
queues.  At  each  step  the  node  compares  the  destination  addresses  of  the  messages 
at  the  heads  of  the  two  queues.  The  message  with  the  smaller  destination  address  is 
transmitted  forward.  If  both  messages  are  destined  for  the  same  location,  they  are 
combined  and  only  one  request  is  sent  out.  Finzdly,  if  only  one  queue  has  a  message 
waiting  and  the  other  queue  is  empty,  no  message  is  sent  out.  (If  the  message 
were  sent,  the  next  message  along  the  other  edge  could  conceivably  have  a  smaller 
destination,  thus  violating  the  sorting  requirement) . 

In  our  snapshot  at  time  T,  node  A  in  Figure  3  selects  the  message  destined  for 
location  35.  Then  it  waits  until  the  message  to  location  48  arrives,  at  which  point 
it  discovers  that  the  messages  at  the  heads  of  both  the  queues  are  to  location  48, 
and  can  be  combined. 

3.4.1  Reply  routing 

How  do  we  return  the  data  to  all  requesting  processors?  The  reply  message,  upon 
reading  the  data,  returns  backwards  along  each  edge  of  the  tree  and  reaches  every 
requesting  processor.  For  the  backrouting  we  only  need  to  store  two  direction  bits 
at  each  node.  The  bits  say  whether  the  request  came  along  the  top  branch,  the 
bottom  one,  or  along  both.  Since  messages  are  kept  sorted  throughout  the  six 
phases,  replies  at  each  node  arrive  in  the  same  order  as  the  requsts  were  sent  out. 
Therefore,  the  direction  bits  can  be  stored  in  a  2-bit  wide  FIFO  queue.  This  simple 
idea  is  more  efficient  than  the  associative  memories  proposed  earlier  [7]. 


Figure  3:  Combining  Messages  by  Merging  Streams 


3.4.2  Ghost  messages 

The  simple  idea  of  keeping  message  streams  sorted  has  one  deficiency.  Consider 
Figure  3  again.  At  time  T,  processor  B  cannot  transmit  the  message  it  holds  for 
location  25,  because  it  does  not  know  what  will  arrive  on  the  top  link.  However, 
when  A  selects  the  message  to  location  35  for  transmission,  it  can  send  a  ghost 
message  labelled  35  to  B.  When  B  receives  the  ghost  message,  it  knows  that  future 
messages  along  that  edge  will  be  destined  for  locations  greater  than  35.  Therefore, 
at  the  next  time  step  B  can  forward  the  message  waiting  in  the  lower  queue. 

Ghost  messages  notify  nodes  of  the  minimum  location  to  which  subsequent  mes¬ 
sages  can  be  destined.  Ghosts  are  not  used  for  any  other  purpose,  they  “keep  the 
system  fluent.” 

3.4.3  Flow  control 

It  is  possible  that  a  switch  5  is  ready  to  transmit  a  message  forward  but  the  input 
queue  for  next  switch  is  full.  When  this  happens,  S  sunply  retams  the  message  and 
tries  in  the  next  clock  cycle.  Of  course,  if  the  message  S  tried  to  transmit  was  a 
ghost,  it  can  be  dropped  without  any  loss  of  information. 

Many  routing  algorithms  which  adopt  such  a  holding  policy  give  poor  perfor¬ 
mance  because  congested  buffers  back  up  buffers  upstream.  For  our  algorithm  the 
probability  of  such  degradation  is  provably  miniscule,  and  the  algorithm  is  always 
deadlock-free. 

3.4.4  Termination 

Immediately  following  a  request,  each  processor  also  issues  an  end-of-stream  EOS  message 
The  dest  field  of  every  end-of-stream  message  is  oo.  An  EOS  notifies  a  switch  that 
no  more  requests  will  follow.  The  switch  can  now  safely  forward  the  requests  on 
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the  other  edge,  and  eventually  forward  the  EOS  messages  themselves.  EOS  messages 
form  a  wavefront  which  guarantees  that  every  instruction  will  terminate. 

3.4.5  Performance 

Following  Ranade  [16],  we  can  show  that  this  routing  algorithm  is  close  to  optimal. 

Theorem  2  Assuming  a  perfect  random  address  map,  the  probability  that  any 
memory  reference  takes  more  than  15  log  N  steps  is  less  than  N~^^. 

Every  routing  algorithm  must  take  at  least  4logiV  steps.  Observe  that  the 
provable  performance  is  only  slightly  far  from  this  lower  bound,  and  considerably 
faster  than  previous  algorithms  for  routing  on  butterflies  of  reasonable  size. 

Figure  7  gives  timing  results  from  simulations  of  the  routing  algorithm.  We 
experimented  with  a  number  of  different  memory  access  patterns,  e.g.  matrix  access, 
trees  of  different  types,  shuffles,  random  permutations  etc.  In  no  case  was  the  time 
taken  more  than  11  log  iV,  even  with  queues  of  size  2.  Increasing  queue  size  did 
not  appreciablly  affect  performance.  We  found  that  simple  hash  functions  (shared 
variable  x  mapped  to  physical  address  ax  +  b  mod  M)  were  satisfactory.  Section 
5.1  describes  more  simulation  experiments. 

3.5  Multipreiix  instructions 

We  first  describe  the  implementation  for  fetch-and-add  proposed  in  [7].  Let  s  be 
an  arbitrary  switch  in  phase  1  (or  2).  Suppose  that  the  messages  at  the  heads  of 
the  queues  are  mi  =  (/,  fetch-add,  vi)  and  mj  =  (/,  fetch-add,  t;2)  respectively.  As 
shown  in  [7]  the  switch  must  forward  a  message  m  =  {/,  fetch-add,  vi  + 1;2)  hi  place 
of  mi  and  m2.  If  the  reply  to  m  is  a  value  v,  then  the  corresponding  switch  in  phase 
6  (or  5)  returns  v  as  a  reply  to  mi,  and  v  +  ui  as  a  reply  to  m2.  Thus  the  switch 
must  remember  the  value  vi  received  on  its  top  queue  for  each  pair  of  fetch-and-add 
messages  that  it  combines. 

Notice  that  this  is  equivalent  to  a  serial  execution  of  the  message  received  on  the 
top  input  (mi)  before  the  message  received  on  the  bottom  input  (m2).  Thus  if  we 
ensure  that  messages  received  on  the  top  input  always  originate  in  a  processor  with 
a  smaller  number  than  those  received  at  the  bottom  input,  we  effectively  have  an 
implementation  for  the  multiprefix  operation,  with  addition  replaced  by  the  prefix 
operator.  We  show  how  to  do  this  by  numbering  the  processors  appropriately. 

Theorem  3  The  multiprefix  operation  will,  with  overwhelming  probability,  termi¬ 
nate  in  O(logiV)  steps. 

Proof:  We  present  the  required  numbering  for  the  processors  and  switch  inputs. 
Processor  (c,  r)  is  numbered  nr  -|-  c.  A  switch  (c,  r)  in  phases  1  or  2  receives  its 
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Phase  1(2) 


Phase  6(5) 

Figure  4:  Fetch-and-add 


inputs  t'o,  *1  from  switches  (c  —  l,ro)  and  (c  —  l,ri)  respectively.  If  ro  <  ri,  then 
we  shall  label  t’o  as  the  top  else  we  label  t'l  as  the  top.  I 

As  noted  earlier,  the  only  extra  requirement  over  a  read  instruction  is  that,  in 
addition  to  the  two  direction  bits,  each  switch  must  remember  a  value  {^partial  sumj 
for  every  combination  that  occurs  at  that  switch.  Figure  5  shows  a  padr  of  switches 
with  the  required  queues. 


3.6  Processor  synchronization 

"It  is  always  4  o’  clock  here,”  said  the  March  Hare  to  Alice. 

— Lewis  Carroll,  Alice  in  Wonderland 

We  use  EOS  messages  to  implement  a  distributed  global  clock.  Recall  that  one 
EOS  message  per  instruction  passes  through  each  switch.  By  maintaining  a  coimt  of 
the  number  of  EOS  messages  that  have  passed  through,  each  switch  keeps  its  version 
of  the  global  time. 

Different  switches  may  indeed  have  different  counts  or  versions  of  the  global 
time,  but  that  is  perfectly  alright.  If  two  instructions  access  a  common  location  in 
the  same  time  step,  then  the  one  that  arrives  first  will  have  to  wait  for  the  slower 
one  to  reach  an  intermediate  switch  for  combination.  Because  we  keep  messages 
sorted  by  tag,  and  we  guarantee  that  only  one  request  for  access  will  be  passed 
into  the  memory  module  which  holds  the  variable,  the  effect  is  the  same  as  if  all 
the  processors  were  operating  synchronously.  For  example,  our  implementation 
gucirantees  that  for  the  code  of  figure  6  processor  1  and  2  will  respectively  read 
10  and  20,  provided  no  other  processor  writes  a  and  b  in  the  meantime.  This  is 
guaranteed  in  spite  of  the  fact  that  both  processors  might  issue  all  3  instructions 
without  waiting  for  any  to  complete.  This  is  a  very  strong  synchronization  condition 
requiring  special  primitives  on  most  other  programming  models. 
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Figure  5:  Internals  of  a  pair  of  switches 
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Figure  6:  Synchronization  Guarantee 
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This  implementation  also  allows  each  processor  to  stop  the  global  clock  if  nec¬ 
essary,  if  it  detects  an  error  for  example.  This  is  done  by  withholding  the  end-of- 
stream  message. 

For  lack  of  space  we  will  not  describe  how  set  operations  are  implemented.  The 
interested  reader  is  referred  to  [15]. 

4  The  Routing  Switch 

In  this  section  we  outline  a  bit-serial  design  for  the  routing  switch  and  estimate  its 
layout  requirements.  The  design  extends  to  wider  data  paths  in  a  straightforward 
manner. 

Although  Section  3  presented  the  routing  algorithm  with  the  implicit  assumption 
that  messages  were  transmitted  in  atomic  packets,  this  is  not  necessary.  In  partic¬ 
ular,  each  message  can  be  transmitted  bit-serially  in  a  pipelined  manner.  This  is 
analogous  to  the  wormhole  router  of  Dally  and  Seitz  [5] .  Message  transmbsion  can 
be  pipelined  because: 

1.  Address  comparison  can  be  done  bit-serially,  provided  the  addresses  are  re¬ 
ceived  most  significzmt  bit  first. 

2.  Message  combination  can  be  done  bit-serially;  for  operators  like  -h,  the  data 
must  be  transmitted  least  significant  bit  first.  Also  see  on-line  arithmetic  [17]. 

3.  When  a  message  leaves  a  switch,  the  corresponding  GHOST 

message  (whose  dest  is  identical  to  the  real  message)  can  be  generated  bit- 
serially. 

Each  message  is  transmitted  with  the  dest  field  first  (most  significant  bit  lead¬ 
ing),  followed  by  the  type  field,  and  finally  the  data  field  (least  significant  bit  lead¬ 
ing)..  A  switch  begins  operating  when:  (1)  each  input  queue  contains  at  least  one 
message,  and  (2)  the  input  queues  of  the  receiving  switches  are  not  full. 

We  now  describe  the  operation  of  a  switch  in  phase  2.  Switches  in  other  phases 
can  be  specified  similarly. 

1.  Transmit  dest:  The  minimum  of  the  destinations  of  the  two  messages  in  the 
input  queues  is  transmitted  along  both  outputs..  The  minimum  is  discovered 
only  after  the  transmission,  so  till  then  both  destinations  must  be  retained  in 
the  input  queues. 

2.  Transmit  type:  While  transmitting  the  destination,  the  switch  detects  which 
output  link  the  request  must  be  routed  on.  This  requires  checking  one  fixed 
bit  in  the  destfield.  The  type  of  the  message  with  the  minimum  destination  is 
transmitted  on  the  that  output,  while  on  the  other,  type  GHOST  is  transmit¬ 
ted. 
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3.  Transmit  data:  This  is  relevant  for  messages  like  MP-®  or  WRITE.  In  ei¬ 
ther  case,  the  message  type  indicates  how  messages  must  be  combined  when 
necessary.  Again,  the  data  fields  can  be  combined  and  transmitted  as  they 
arrive. 

The  ability  to  pipeline  messages  speeds  up  message  delivery  considerably  when 
there  are  no  queueing  delays.  The  message  delivery  time  reduces  from  (network 
latency)  x  (message  length)  to  (network  latency)  +  (message  length).  We  expect 
the  latency  of  each  switch  to  be  about  4  (message  enters  an  input  queue,  passes 
through  the  ALU,  b  sent  to  the  output  queue,  and  then  transmitted),  giving  a  total 
latency  of  4  x  6n  for  the  logical  network.  Assuming  100  bit  long  messages  and  4-bit 
wide  data  paths,  the  time  for  a  13  dimensional  butterfly  b  (4  x  6  x  13)  + 100/4  =  337 
steps. 

We  now  estimate  the  area  requirements  for  the  routing  switches  per  node.  Each 
switch  consists  of  message  queues,  an  ALU  (for  address  comparison,  message  com¬ 
bination,  etc.),  counters  to  maintain  the  message  FIFO  queues,  memory  for  storing 
pairtial  sums,  and  direction  bits  for  reply  routing.  In  the  following  we  assume  that 
messages  are  100  bits  wide,  and  that  partial  sums  are  64  bits  wide. 

Switches  in  phases  2  and  5  have  two  input  queues,  while  others  only  have  one 
input  queue.  The  total  number  of  message  queues  per  node  b  therefore  8.  Sim¬ 
ulations  (section  5.1)  indicate  that  for  100,000  node  machine  each  message  queue 
need  hold  only  3  messages.  The  total  memory  requirement  for  message  queues  thus 
equab  8  X  3  X  100  =  2400  bits,  or  roughly  1.2  MA*(at  500A*per  bit*). 

Simulations  abo  strongly  indicate  that  no  switch  will  ever  transmit  more  than  40 
messages  along  its  outputs.  For  reply  routing  we  need  2  bit  wide  direction  queues, 
and  64  bit  wide  partial  siims.  Long  partial  sum  queues  are  maintained  only  in  phase 
2  so  that  the  total  memory  requirement  adds  up  to  40  x  64  +  6  x  2  x  40  =  3040 
bits,  or  1.52  MA’. 

Each  queue  requires  3  counters,  except  for  the  message  queues  which  require 
4.  Assuming  8  bit  wide  counters,  the  total  memory  b  424  bits.  With  3000  A*per 
counter  bit,  total  area  requirement  b  1.28  MA*. 

Assuming  8  bit  wide  data  paths,  each  ALU  requires  around  1.2  MA*,  for  a  total 
of  7.2  MA*per  node. 

The  total  area  requirement  b  thus  approximately  11.2  MA*.  Including  mbcella- 
neous  overhead,  15  MA*b  a  conservative  estimate  for  6  switches  per  node. 


5  The  Fluent  Machine 

Thb  section  presents  an  outline  for  a  Fluent  machine  which  can  be  constructed 
within  the  next  few  yeaxs  with  conservative  technology.  Table  1  summarizes  our 

*Tlie  estimates  for  the  different  components  are  from  [12]. 
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Feature  size  for  VLSI 

1/1  (A  =  0.5/i) 

Chip  size 

100mm*  =  400  MA* 

Pins  per  chip 

150 

Printed  circuit  boardsize 

.5  m  X  .5  m 

Off  board  connections 

512 

Table  1:  Technology  for  the  Fluent-I 


switches 

11^0201 

2  32-bit  RISC  Processors 

Floating  point  unit 

100  MA^ 

128  Kbytes  memory  per  processor 

Total  area  requirement  per  chip 

Table  2:  Chip  Specification 

assumptions  about  the  technology  available.  Needless  to  say,  breakthroughs  in 
packaging  technologies  will  have  the  largest  impact. 

The  Fluent-I  is  organized  as  a  IS-dimensional  butterfly,  with  2^*  nodes  in  each 
of  14  ranks  for  a  total  of  114,688  nodes.  These  nodes  are  divided  into  256  boards, 
each  housing  a  6-dimensional  butterfly.  The  network  is  partitioned  into  2  planes  of 
boards,  arranged  in  the  manner  suggested  by  Wise  [18].  Each  board  has  448  nodes, 
divided  among  224  chips,  with  2  nodes  per  chip.  In  addition  to  the  2  processors,  each 
chip  also  has  routing  switches  for  the  two  nodes,  one  floating  point  unit  (multiplier 
and  adder),  and  memory.  Table  2  summarizes  the  breakup  of  chip  area,  using 
estimates  as  in  the  previous  section. 

Data  paths  between  nodes  vary  in  width  depending  on  whether  the  path  is  on¬ 
board  or  across  boards.  Each  board  has  128  4-bit  wide  data  paths  out  (64  nodes 
in  the  last  rank  of  a  6-dimensional  butterfly,  each  with  2  forward  links).  On-board 
paths  are  8  bits  wide.  The  butterfly  can  be  partitioned  so  that  each  chip  requires 
16  data  paths  so  that  128  pin  connections  suffice. 

This  variation  in  data  path  widths  was  not  considered  in  the  previous  sections. 
The  performance  of  the  routing  algorithm  changes  somewhat  with  narrow  channels. 
The  off-board  channels  also  have  to  be  multiplexed  over  the  6  phases  of  the  logical 
network,  while  on-board  channels  are  replicated.  At  worst  one  would  estimate  that 
the  4-bit  wide  off-board  channels  would  slow  the  system  by  a  factor  of  12  (the  other 
channels  are  8  bits  wide) ,  but  our  simulations  show  that  this  is  wildly  pessimistic. 


15 


Processors 

114,688 

Floating  Point  Units 

57,344 

Memory 

16  Gbytes 

Cycle  time 

50  ns 

Peak  Floating  Point  Rate 

2.3  Tflops 

Table  3:  Fluent-I  Highlights 
5.1  Simulation  Results 

We  performed  timing  simulations  of  the  conununication  network  on  the  Connection 
Machine.  The  objectives  were  to  observe  the  sensitivity  of  the  routing  scheme  to 
variations  in  queue  size,  address  maps,  and  memory  reference  patterns.  A  final 
objective  was  to  study  the  effect  of  using  multiplexed,  narrower  offboard  channels. 

Our  conclusions  in  brief: 

1.  Simple  hash  functions  perform  well.  We  tried  various  linear  congruential 
maps:  variable  x  placed  in  location  ax  +  b  mod  M,  where  M,  the  size  of  the 
address  space,  is  a  prime,  and  a  and  b  are  constants. 

2.  Routing  time  varies  little  with  2u:cess  pattern.  We  tried  several  patterns:  ma¬ 
trix  access,  binary  trees,  shuffle  permutations,  random  accesses  etc.  Random 
patterns  took  slightly  longer  in  all  cases. 

3.  Concurrent  access  is  faster  than  exclusive  access.  The  extreme  case  is  when 
all  processors  read  the  same  variable.  The  number  of  steps  reduces  from  154 
(see  Figure  7)  to  85  because  there  is  no  buffering  delay.  This  asstimes  that 
messages  are  no  wider  than  channels  (see  below  for  further  discussion). 

4.  Queue-size  3  is  adequate.  While  queue  size  1  degrades  performance  drasti¬ 
cally,  queue  sizes  2  or  more  give  similar  performance. 

5.  Figure  7  plots  the  routing  time  when  off-board  channels  are  multiplexed,  with 
narrowness  being  the  ratio  of  the  width  of  the  offboard  channels  to  the  on¬ 
board  channels.  Switches  in  lower  phases  are  given  higher  priority  in  accessing 
channels.  Each  channel  first  allows  phase  0  messages  to  pass,  followed  by  phase 
1  messages,  and  so  on.  From  the  plot  we  can  conclude  that  the  performance 
degrades  by  a  factor  of  1.7  over  the  ideal  case' (no  narrow  channels  and  no 
multiplexing).  The  time  goes  up  from  154  steps  as  in  Figure  7  to  about  260 
steps  (extrapolated  for  114,688  processors  from  Figure  7). 
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5.2  Router  performance 

Suppose  that  messages  are  100  bits  long  (64  data  bits,  32  address  bits,  and  4  type 
bits).  If  every  channel  was  8  bits  wide,  sending  a  message  across  one  link  would 
require  100/8  »  13  steps.  From  the  results  of  the  previous  section  we  can  therefore 
estimate  that,  with  narrow  channels  and  multiplexing,  an  arbitrary  permutation 
can  be  routed  in  260  x  13  =  3380  steps.  With  a  50  nanosecond  clock  rate,  the  time 
is  about  169  /xsec. 

If  all  processors  access  a  single  variable,  then  the  time  is  just  337  cycles  (section 
4),  or  about  17  /isec. 

As  an  example,  suppose  that  we  wish  to  sort  16-bit  numbers,  with  32K  numbers 
in  each  node.  There  are  roughly  3.5  billion  numbers  being  sorted.  On  the  Fluent 
machine,  we  only  need  one  iteration  of  the  procedure  SHORTSORT  from  Section 
2.  For  each  number  being  sorted,  3  shared  memory  instructions  are  executed  (the 
others  are  local).  However,  the  instructions  can  be  packed  into  50  bit  messages. 
The  total  number  of  steps  required  is  3  x  32K  x  3380/2,  or  about  169  million  for 
a  total  time  of  8.5  seconds.  If  the  numbers  are  32  bits  long,  the  time  is  about  17 
seconds.  Note  that  this  is  the  time  to  sort  the  entire  contents  of  memory. 

5.3  Structured  Computations 

Much  work  has  been  done  on  mapping  structured  computations  onto  butterfly  net¬ 
works.  These  computations  do  not  need  the  generality  of  shared  memory.  Better 
performance  can  be  achieved  by  direct  nearest  neighbor  commimication  rather  than 
routing.  This  allows  us  to  utilize  the  floating  point  capabilities  of  the  machine  more 
efficiently. 

Table  3  presents  performance  estimates  for  two  structured  problems:  FFT  and 
Matrix  multiplication.  We  considered  a  2’°  point  complex  FFT,  and  used  the 
standard  mapping.  We  obtain  a  performance  of  between  1.2  Tflops  and  2  Tflops 
depending  upon  the  assumptions  made  about  local  memory  bandwidth.  Batcher’s 
bitonic  sort  {N  numbers)  on  the  butterfly  takes  2  log*  N  steps.  With  32K  16  bit 
numbers  per  node,  and  each  communication  step  requiring  4  cycles,  the  total  time 
is  4  X  2  X  289  x  Z2K  »  75M  cycles.  At  50  ns  clock  this  gives  a  time  of  3.7  seconds. 
While  this  estimate  is  lower  than  that  of  the  shared-memory  radix  sort,  extracting 
the  extra  peformance  requires  non-trivial  and  tedious  low  level  fine  timing. 

Besides  nearest  neighbor  communication,  performance  gains  can  also  be  zudiieved 
by  partitioning  structured  problems  into  blocks,  and  doing  block  computations  lo¬ 
cally  within  each  node.  This  reduces  the  number  of  shared  memory  instructions. 
For  matrix  multiplication  there  are  no  good  mappings  into  the  butterfly  [3].  In¬ 
stead,  we  partition  a  large  matrix  into  block  submatrices,  each  of  which  is  stored 
in  one  node.  Instead  of  mapping  blocks  randomly  to  nodes,  we  use  a  simple  hi¬ 
erarchical  approach:  decompose  the  matrix  into  large  blocks,  and  map  these  into 
rzuidoin  boards.  Next,  decompose  the  large  blocks  into  smaller  blocks  and  map 
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Estimated  multiprefix  time 

169/isec 

R2ulix  Sort  3.5  •  10^  16  bit  numbers 

8.5sec 

Bitonic  Sort  3.5  •  10®  16  bit  numbers 

3.7sec 

Matrix  multiplication 

0.8  Tflops 

FFT 

1.2  Tflops 

Table  4:  Fluent-I  Performance 

them  randomly  into  nodes.  This  allows  us  to  exploit  locality  at  the  processor  and 
board  levels,  and  reduces  the  communication  load  on  the  off-board  channels. 


6  Conclusions  and  Extensions 

Powerful  models  of  parallel  computation  need  neither  be  expensive  nor  slow  -  this 
is  what  we  wish  to  demonstrate  by  building  a  Fluent  parallel  computer.  In  this 
extended  abstract  we  have  presented  the  Fluent  abstract  machine  which  is  more 
powerful  than  any  other  abstract  shared  memory  model,  and  shown  that  it  can  be 
implemented  inexpensively  on  the  Fluent  machine. 

We  are  continuing  simulation  experiments.  By  programming  different  appli¬ 
cations  we  hope  to  get  more  insight  into  the  expressive  power  of  the  Fluent  pro¬ 
gramming  model.  We  also  expect  to  identify  various  tradeoffs,  and  adjust  design 
parameters  accordingly.  For  example,  by  providing  even  wider  data  paths  on  board, 
at  the  expense  of  reducing  the  number  of  switches  per  node  (by  multiplexing  them) 
we  expect  that  overall  performance  can  be  improved. 

In  this  abstract  we  have  not  considered  many  issues  in  processor/chip  design, 
and  have  mostly  presented  very  conservative  estimates  for  area  requirements.  We 
expect  to  begin  detailed  design  of  the  router  and  communications  harware  following 
our  experiences  with  the  simulator.  Future  work  will  throw  more  light  on  issues  such 
as  SIMD  vs.  MIMD  organization,  processor  complexity /wordlength,  and  operating 
system  issues. 
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